Design for Failure #

Assuming things will fail, ensure you carefully review every aspect of your cloud architecture and design for failure scenarios against each one of them. In particular, assume hardware will fail, cloud data center outages will happen, database failure or performance degradation will occur, expected volumes of transactions will be exceeded, and so on. In addition, in an auto-scaled environment, for example, nodes may be shutdown in response to loads getting back to normal levels after a spike. Nodes might be rebooted by the cloud platform. There can also be unexpected application failures. In all cases, the design goal should be to handle such error conditions gracefully and minimize any impact to the user experience.

Everything Fails #

Everything fails all the time: Hard disks break, computers overheat, wires get broken, the power goes out, earthquakes damage buildings, and because of all this, no single device should be considered fault-tolerant.

It is a guarantee that something in your setup will eventually fail. At large enough scale, something is always failing.

The reasoning is simple: hardware is physical—subject to the laws of physics—and thus, it breaks all the time. This is a mantra that anyone designing software systems in the cloud must always keep in mind. It is also one of the design principles on which the Azure cloud solutions are built, and it is a principle that Azure does not eliminate for its customers.

Something or someone has to deal with all these uncertainties. Ideally, cloud solutions providers such as Azure or AWS would solve problems transparently for their customers and would take care of providing a failure-proof environment, but that’s just not how it works.

As Reed Hastings said during a recent Keynote speech,

“We are still in the assembly language phase of cloud computing.”

Coping with failure is for the most part left in the hands of cloud computing users, and software architects must take these inevitable failures very seriously. Design for failure must be deliberate, where there is an assumption that the components in a system will in fact fail, and the best plans design backwards to address them.

When this backwards planning is done, an important advantage is discovered: Designing a solution for reliability leads to a path that, with little additional cost and effort, solves scalability issues as well.

Build Resilient Applications #

There are two main reasons that we design applications for failure.

  • The first reason is User Experience. It’s no secret that you will have user attrition and lost revenue if you cannot shield your end users from issues outside their control.
  • The second reason is Business Services. All business critical systems require resiliency and the difference between a 99.7% uptime and 99.99% could be hours of lost revenue or interrupted business services. Given an application load of 1 billion requests per month, a 99.7% uptime is 2+ hours versus just 4 minutes for 99.99%. Ouch!

Werner Vogels, the CTO of Amazon Web Services once said at a past AWS re:Invent

“Everything fails, all the time.”

It’s a devastating reality and it’s something we all must accept. No matter how mathematically improbable, we simply cannot eliminate all failures. It’s how we reduce the impact of those failures that improves the overall resiliency of our applications.

Core Principles for Failure #

Graceful Degredation #

The way we reduce the impact of failure on our users and business is through graceful degradation. Conceptually it’s very simple – we want to continue to operate in lieu of a failure in some degraded capacity. Keeping with the premise that applications fail all the time, you’ve probably experienced degraded services without even realizing it – and that is the ultimate goal.

Caching #

Caching is the first layer of defense when dealing with a failure.

Depending on your applications reliance on up-to-date bleeding edge information you should consider caching everything. It’s very easy for developers to reject caching because they always want the freshest information for their users. However, when the difference between a happy customer and a sad one is using some few-minute old data… choose the latter. As an example, imagine you have a fairly advanced web application. What can you cache?

  • Full HTML pages with CloudFront
  • Database records with ElastiCache
  • Page Fragments with tools such as Varnish
  • Remote API calls from your backend with ElastiCache

Retry #

As applications get more complex we rely on more external services than ever before. Whether it’s a 3rd party service provider or your microservices architecture at work, failures are common and often transient.

A common pattern for dealing with transient failures on these types of requests is to implement retry logic. Using exponential back off or a Fibonacci sequence you can retry for some time before eventually throwing an exception. It’s important to fail fast and not trigger rate limiting on your source, so don’t continue indefinitely.

Rate Limiting #

In the case of denial of service attacks, self-imposed or otherwise, your primary defense is rate limiting based on a context.

You can limit the amount of requests to your application based on user data, source address or both. By imposing a limit on requests you can improve your performance during a failure by reducing the actual load and the load imposed by your retry logic.

Also consider using exponential back off or a Fibonacci increase to help mitigate particularly demanding services. For example, during a peak in capacity that cannot be met immediately, a reduction in load would allow your applications infrastructure to respond to the demand (think auto scaling) before completely failing.

Fail Fast #

When your application is running out of memory, threads or other resources you can help recovery time by failing fast. You should return an error as soon as possible when it’s detected. Not only will your users be happier not waiting on your application to respond, you will also not cascade the delay into dependent services.

Static Fallback #

Whether you’re rate limiting or simply cannot fail silently, you’ll need something to fallback to. A static fallback is a way to provide at least some response to your end users without leaving them to the wind with erroneous error output or no response at all.

It’s always better to return content that makes sense to the context of the user and you’ve probably seen this before if you’re a frequent user of sites like Reddit or Twitter.

Fail Silently #

When all of your layers of protection have failed to preserve your service, it’s time to fail silently.

Failing silently is when you rely on your logging, monitoring and other infrastructure to respond to your errors with the least impact to the end user. It’s a best practice to return a 200 OK with no content and log your errors on the backend than to return a 500 Internal Server Error, similar HTTP status code or worse yet, a nasty stack trace/log dump.

Failing Fast as a Developer #

There are two patterns that you can implement to improve your ability to fail fast: Circuit Breaking and Load Shedding. Generally, you want to leverage your monitoring tools such as Cloudwatch and your logs to detect failure early and begin mitigating the impact as soon as possible. These two patterns are automation at it’s finest.

Circuit Breaking #

Circuit breaking is purposefully degrading performance in light of failure events in your logging or monitoring system. You can utilize any of the degradation patterns mentioned above in the circuit. Finally, by implementing health checks into your service you can restore normal service as soon as possible.

Load Shedding #

Load shedding is a method of failing fast that occurs at the networking level. Like circuit breaking, you can rely on monitoring data to reroute traffic from your application to a Static Fallback that you have configured. For example, Route53 has failover support built right in that would allow you to use this pattern right away.